Semi-Automatic Labeling of Training Data Sets in Text Classification

نویسندگان

  • Nayereh Ghahreman
  • Ahmad Baraani-Dastjerdi
چکیده

Web includes digital libraries and billions of text documents. A fast and simple search through this sizeable set is important for users and researchers. Since manual or rule based document classification is a difficult, time consuming process, automatic classification systems are absolutely needed. Automatic text classification systems demand extensive and proper training data sets. To provide these data sets, usually, numerous unlabeled documents are labeled manually by experts. Manual labeling of documents is a difficult and time consuming process. Moreover, in manual labeling, due to human exhaustion and carelessness, there is the possibility of mistakes. In this study, semi-automatic creation of training data set has been proposed in a way that only a small percentage of this extensive set’s documents is labeled manually and the remaining percentage is done automatically. Results show that by labeling only ten percent of the training set, remaining documents can be automatically labeled with 98 percent of accuracy. It is worth mentioning that this reduction in accuracy only occurs in standard data sets, while for large practical data sets, this reduction is trivial compared to the accuracy reduction resulted by human exhaustion and carelessness.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Interpretation of UltraCam Imagery by Combination of Support Vector Machine and Knowledge-based Systems

With the development of digital sensors, an increasing number of high-resolution images are available. Interpretation of these images is not possible manually, which necessitates seeking for practical, fast and automatic solutions to solve the environmental and location-based management problems. The land cover classification using high-resolution imagery is a difficult process because of the c...

متن کامل

Extraction of Training Sets for Experimentation with Cross Language Information Retrieval Systems

In this paper we focus on methods, models and tools for the extraction of bilingual training / test sets useful for the (semi) automatic classification of textual documents. Such documents could be tutorials, technical specifications, articles, personal notes, etc. Another motivation for our research is the need for managing corpus of classified texts and especially parallel corpora (texts). We...

متن کامل

Automatic recognition of German news focusing on future-directed beliefs and intentions

We consider the classification of German news stories as either focusing on future-directed beliefs and intentions or lacking these. The method proposed in this article requires only a small set of labeled training data. Rather, we introduce German clues for the automatic identification of future-orientation which are used for automatic labeling of Reuters news stories. We describe the developm...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Corpus Based Unsupervised Labeling of Documents

Text categorization involves mapping of documents to a fixed set of labels. A similar but equally important problem is that of assigning labels to large corpora. With a deluge of documents from sources like the World Wide Web, manual labeling by domain experts is prohibitively expensive. The problem of reducing effort in labeling of documents has warranted a lot of investigation in the past. Mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer and Information Science

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2011